Stable Diffusion WebUI Critical CUDA Out of Memory Detection Rule #134

MAVRICK-1 · 2025-08-27T08:59:39Z

🎯 Overview

This PR introduces a comprehensive detection rule for Stable Diffusion WebUI CUDA Out of Memory failures - addressing one of the most critical and widespread issues affecting AUTOMATIC1111 Stable Diffusion deployments globally. The rule identifies CUDA memory exhaustion leading to complete WebUI service failure requiring manual intervention.

CRE Playground Links

CRE-2025-0130 Playground: Test Rule

🚨 Problem Statement

High-Severity Issue: Stable Diffusion WebUI CUDA failures cause:

Complete service interruption - WebUI becomes unresponsive and requires manual restart
Loss of current image generation progress and any queued generation tasks
Potential CUDA context corruption requiring process restart to recover
User experience degradation with failed image generations and error messages
System instability in multi-user deployments where one user's OOM affects others
Cascading failures where recovery attempts also fail due to memory constraints

Why This Matters: Stable Diffusion CUDA failures are particularly dangerous because:

High-resolution image generation (1024x1024+) requires massive GPU VRAM
Failures often occur mid-generation causing complete data loss
AUTOMATIC1111 WebUI has millions of users globally
Issues manifest as generic crashes making diagnosis difficult
Memory fragmentation prevents allocation of required contiguous memory blocks
Requires immediate intervention to restore service functionality

Rule Performance

Detection Rate: 2 critical hits with sequence matching
Processing Speed: 64.52K lines/s processing
Window: 30-second detection window captures failure cascade
False Positive Rate: Low (specific PyTorch CUDA error patterns)

📊 Stable Diffusion Issues Covered

#	Issue Type	Example Error Pattern
1	CUDA Memory Exhaustion	`torch.cuda.OutOfMemoryError: CUDA out of memory`
2	Model Loading Failures	Failed to allocate tensor on device
3	Generation Process Crashes	Fatal error during image generation
4	WebUI Unresponsiveness	Gradio interface becoming unresponsive
5	Recovery Failures	Recovery failed - WebUI requires restart
6	CUDA Context Corruption	CUDA context may be corrupted
7	Complete Service Failure	Complete service failure - manual intervention required

🧪 Testing & Validation

CRE Rule Testing

cd stable-diffusion-demo
cat logs/sd-webui-cuda-oom.log | preq -r ../rules/cre-2025-0130/stable-diffusion-cuda-oom.yaml -d

Test Results:

🎬 Demo Environment

Repo link (private invitation already send) https://github.com/MAVRICK-1/cuda-oom

Screencast.from.2025-08-27.13-19-35.mp4

./start-demo.sh
cat logs/roop-cuda-oom.log | preq -r stable-diffusion-cuda-oom.yaml -d

Fixes #130
/claim #130

…ailures

Add rules for handling CUDA out of memory errors and log generation f…

0827db1

…ailures

algora-pbc bot added the 🙋 Bounty claim label Aug 27, 2025

algora-pbc bot mentioned this pull request Aug 27, 2025

Stable Diffusion Web UI: Reproduce A High-Severity Failure & Write a CRE Rule [Multiple Winners] [Submit by August 31 11:59 pm ET] #130

Closed

MAVRICK-1 added 2 commits September 3, 2025 20:37

Merge branch 'main' into feat/cre-2025-0130-roop-cuda-oom

e6a2773

Merge branch 'main' into feat/cre-2025-0130-roop-cuda-oom

88481b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stable Diffusion WebUI Critical CUDA Out of Memory Detection Rule #134

Stable Diffusion WebUI Critical CUDA Out of Memory Detection Rule #134

Uh oh!

MAVRICK-1 commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Stable Diffusion WebUI Critical CUDA Out of Memory Detection Rule #134

Are you sure you want to change the base?

Stable Diffusion WebUI Critical CUDA Out of Memory Detection Rule #134

Uh oh!

Conversation

MAVRICK-1 commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 Overview

CRE Playground Links

🚨 Problem Statement

Rule Performance

📊 Stable Diffusion Issues Covered

🧪 Testing & Validation

CRE Rule Testing

🎬 Demo Environment

Uh oh!

Uh oh!

MAVRICK-1 commented Aug 27, 2025 •

edited

Loading